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Abstract 

We derive a second-order ordinary differential equation (ODE) which is the limit of Nes¬ 
terov’s accelerated gradient method. This ODE exhibits approximate equivalence to Nes¬ 
terov’s scheme and thus can serve as a tool for analysis. We show that the continuous time 
ODE allows for a better understanding of Nesterov’s scheme. As a byproduct, we obtain 
a family of schemes with similar convergence rates. The ODE interpretation also suggests 
restarting Nesterov’s scheme leading to an algorithm, which can be rigorously proven to 
converge at a linear rate whenever the objective is strongly convex. 

Keywords: Nesterov’s accelerated scheme, convex optimization, first-order methods, 

differential equation, restarting 

1. Introduction 

In many fields of machine learning, minimizing a convex function is at the core of efficient 
model estimation. In the simplest and most standard form, we are interested in solving 

minimize /(x), 

where / is a convex function, smooth or non-smooth, and x € M n is the variable. Since 
Newton, numerous algorithms and methods have been proposed to solve the minimization 
problem, notably gradient and subgradient descent, Newton’s methods, trust region meth¬ 
ods, conjugate gradient methods, and interior point methods (see e.g. Polyak, 1987; Boyd 
and Vandenberghe, 2004; Nocedal and Wright, 2006; Ruszczyriski, 2006; Boyd et ah, 2011; 
Shor, 2012; Beck, 2014, for expositions). 

First-order methods have regained popularity as data sets and problems are ever in¬ 
creasing in size and, consequently, there has been much research on the theory and practice 
of accelerated first-order schemes. Perhaps the earliest first-order method for minimizing 
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a convex function / is the gradient method, which dates back to Euler and Lagrange. 
Thirty years ago, however, in a seminal paper Nesterov proposed an accelerated gradient 
method (Nesterov, 1983), which may take the following form: starting with xq and yo = xq, 
inductively define 

x k = y k _i - sV/(y fc -i) 

k — 1 . . 

Vk — X k T ^ 2 \Xk X k — l). 

For any fixed step size s < 1 /L, where L is the Lipschitz constant of V/, this scheme 
exhibits the convergence rate 


Above, x * is any minimizer of / and /* = f(x*). It is well-known that this rate is op¬ 
timal among all methods having only information about the gradient of / at consecutive 
iterates (Nesterov, 2004). This is in contrast to vanilla gradient descent methods, which 
have the same computational complexity but can only achieve a rate of 0(l/k). This 
improvement relies on the introduction of the momentum term x k — x k -i as well as the 
particularly tuned coefficient (k — 1) / (k + 2) ~ 1 — 3 /k. Since the introduction of Nesterov’s 
scheme, there has been much work on the development of first-order accelerated methods, 
see Nesterov (2004, 2005, 2013) for theoretical developments, and Tseng (2008) for a unified 
analysis of these ideas. Notable applications can be found in sparse linear regression (Beck 
and Teboulle, 2009; Qin and Goldfarb, 2012), compressed sensing (Becker et ah, 2011) and, 
deep and recurrent neural networks (Sutskever et ah, 2013). 

In a different direction, there is a long history relating ordinary differential equation 
(ODEs) to optimization, see Helrnke and Moore (1996), Schropp and Singer (2000), and 
Fiori (2005) for example. The connection between ODEs and numerical optimization is often 
established via taking step sizes to be very small so that the trajectory or solution path 
converges to a curve modeled by an ODE. The conciseness and well-established theory of 
ODEs provide deeper insights into optimization, which has led to many interesting findings. 
Notable examples include linear regression via solving differential equations induced by 
linearized Bregman iteration algorithm (Osher et ah, 2014), a continuous-time Nesterov-like 
algorithm in the context of control design (Durr and Ebenbauer, 2012; Durr et ah, 2012), and 
modeling design iterative optimization algorithms as nonlinear dynamical systems (Lessard 
et ah, 2014). 

In this work, we derive a second-order ODE which is the exact limit of Nesterov’s 
scheme by taking small step sizes in (1); to the best of our knowledge, this work is the first 
to use ODEs to model Nesterov’s scheme or its variants in this limit. One surprising fact 
in connection with this subject is that a first-order scheme is modeled by a second-order 
ODE. This ODE takes the following form: 

X + - t X + \7f(X) = 0 (3) 

for t > 0, with initial conditions X(0) = xq,X(0) = 0; here, xq is the starting point 
in Nesterov’s scheme, X = dX/dt denotes the time derivative or velocity and similarly 
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X = d 2 X/dt 2 denotes the acceleration. The time parameter in this ODE is related to the 
step size in (1) via t ~ kyfs. Expectedly, it also enjoys inverse quadratic convergence rate 
as its discrete analog, 

-r<o 

Approximate equivalence between Nesterov’s scheme and the ODE is established later in 
various perspectives, rigorous and intuitive. In the main body of this paper, examples and 
case studies are provided to demonstrate that the homogeneous and conceptually simpler 
ODE can serve as a tool for understanding, analyzing and generalizing Nesterov’s scheme. 

In the following, two insights of Nesterov’s scheme are highlighted, the first one on 
oscillations in the trajectories of this scheme, and the second on the peculiar constant 3 
appearing in the ODE. 

1.1 From Overdamping to Underdamping 

In general, Nesterov’s scheme is not monotone in the objective function value due to the 
introduction of the momentum term. Oscillations or overshoots along the trajectory of 
iterates approaching the minimize! - are often observed when running Nesterov’s scheme. 
Figure 1 presents typical phenomena of this kind, where a two-dimensional convex function 
is minimized by Nesterov’s scheme. Viewing the ODE as a damping system, we obtain 
interpretations as follows. 

Small t. In the beginning, the damping ratio 3/t is large. This leads the ODE to be an 
overdamped system, returning to the equilibrium without oscillating; 

Large t. As t increases, the ODE with a small 3 /t behaves like an underdamped system, 
oscillating with the amplitude gradually decreasing to zero. 

As depicted in Figure la, in the beginning the ODE curve moves smoothly towards the 
origin, the minimizer x*. The second interpretation “Large t” provides partial explanation 
for the oscillations observed in Nesterov’s scheme at later stage. Although our analysis 
extends farther, it is similar in spirit to that carried in O’Donoghue and Candes (2013). 
In particular, the zoomed Figure lb presents some butterfly-like oscillations for both the 
scheme and ODE. There, we see that the trajectory constantly moves away from the origin 
and returns back later. Each overshoot in Figure lb causes a bump in the function values, 
as shown in Figure lc. We observe also from Figure lc that the periodicity captured by the 
bumps are very close to that of the ODE solution. In passing, it is worth mentioning that 
the solution to the ODE in this case can be expressed via Bessel functions, hence enabling 
quantitative characterizations of these overshoots and bumps, which are given in full detail 
in Section 3. 

1.2 A Phase Transition 

The constant 3, derived from (k + 2) — (k — 1) in (3), is not haphazard. In fact, it is the 
smallest constant that guarantees 0(l/t 2 ) convergence rate. Specifically, parameterized by 
a constant r, the generalized ODE 



x+ r -x + V/(X) = 0 
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Figure 1: Minimizing / = 2 x 10~ 2 x 2 + 5 x lO -3 :^, starting from xq = (1, 1). The black 
and solid curves correspond to the solution to the ODE. In (c), for the x-axis we use the 
identification between time and iterations, t = ky/s. 


can be translated into a generalized Nesterov’s scheme that is the same as the original 
(1) except for ( k — l)/(k + 2) being replaced by (k — l)/(k + r — 1). Surprisingly, for 
both generalized ODEs and schemes, the inverse quadratic convergence is guaranteed if and 
only if r > 3. This phase transition suggests there might be deep causes for acceleration 
among first-order methods. In particular, for r > 3, the worst case constant in this inverse 
quadratic convergence rate is minimized at r = 3. 

Figure 2 illustrates the growth of t 2 (f(X(t)) — /*) and sk 2 (f(xk ) — /*), respectively, 
for the generalized ODE and scheme with r = 1, where the objective function is simply 
f(x) = t}X 2 . Inverse quadratic convergence fails to be observed in both Figures 2a and 2b, 
where the scaled errors grow with t or iterations, for both the generalized ODE and scheme. 



(a) Scaled errors t 2 (f(X(t)) — /*). (b) Scaled errors sk 2 {f(xk) — /*). 



Figure 2: Minimizing / = \x 2 by the generalized ODE and scheme with r = 1, starting 
from xq = 1. In (b), the step size s = 10~ 4 . 


1.3 Outline and Notation 

The rest of the paper is organized as follows. In Section 2, the ODE is rigorously derived 
from Nesterov’s scheme, and a generalization to composite optimization, where / may be 
non-smooth, is also obtained. Connections between the ODE and the scheme, in terms 
of trajectory behaviors and convergence rates, are summarized in Section 3. In Section 
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4, we discuss the effect of replacing the constant 3 in (3) by an arbitrary constant on the 
convergence rate. A new restarting scheme is suggested in Section 5, with linear convergence 
rate established and empirically observed. 

Some standard notations used throughout the paper are collected here. We denote by 
J-j j the class of convex functions / with L-Lipschitz continuous gradients defined on M n , 
i.e., / is convex, continuously differentiable, and satisfies 

IIV/(cc) — V/(y)|| < L\\x — y\\ 

for any x,y E M n , where || ■ || is the standard Euclidean norm and L > 0 is the Lipschitz 
constant. Next, S M denotes the class of //-strongly convex functions / on R n with continuous 
gradients, i.e., / is continuously differentiable and f(x) —/i||x|| 2 /2 is convex. We set = 

Fl n Sjj . 


2. Derivation 


First, we sketch an informal derivation of the ODE (3). Assume / € Fl for L > 0. 
Combining the two equations of (1) and applying a rescaling gives 




k-lxk- Xk-i 
k + 2 yfs 


V~sXf{y k ). 


(4) 


Introduce the Ansatz x k ~ X(kyfs) for some smooth curve X(t) defined for t > 0. Put 
k = t/yfs. Then as the step size s goes to zero, X(t) ~ x t/yJ i = x k and X(t + y/s) « 
x {t+^/s\/y/l = x k +i, and Taylor expansion gives 

(xk+l - X k)/Vs = X(t) + ^X(t)y/s + o(y/s), (x k - X k _ 1 )/y/s= X(t) - ^X(t)y/s + o(y/s) 
and y/sVf(y k ) = \/iV/(X(t)) + o(y/s). Thus (4) can be written as 


X(t) + -X(t)y/s + o(y/s) 

= (f - yy) (x{t) - ^x(t)y/s + 0 (V5)) - y/sVf(X(t)) + o(y/s). (5) 

By comparing the coefficients of y/s in (5), we obtain 

X + jX + Xf(X) = 0. 

The hrst initial condition is X(0) = Xq. Taking k = 1 in (4) yields 

( x 2 - X 1 )/yfs = -yfsXf(yi) = o(l). 

Hence, the second initial condition is simply X (0) = 0 (vanishing initial velocity). 

One popular alternative momentum coefficient is 6 k (6 k / j — 1), where 9 k are iteratively 

defined as 9 k + i = ( + 4 9\ — /2, starting from 9q = 1 (Nesterov, 1983; Beck and 
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Teboulle, 2009). Simple analysis reveals that 6^(6^ 1 — 1) asymptotically equals 1 — 3/k + 
0(l/k 2 ), thus leading to the same ODE as (1). 

Classical results in ODE theory do not directly imply the existence or uniqueness of the 
solution to this ODE because the coefficient 3/f is singular at t = 0. In addition, V/ is 
typically not analytic at xq, which leads to the inapplicability of the power series method for 
studying singular ODEs. Nevertheless, the ODE is well posed: the strategy we employ for 
showing this constructs a series of ODEs approximating (3), and then chooses a convergent 
subsequence by some compactness arguments such as the Arzela-Ascoli theorem. Below, 
C 2 ((0, oo); M n ) denotes the class of twice continuously differentiable maps from (0, oo) to M n ; 
similarly, C 1 ([0, oo); M n ) denotes the class of continuously differentiable maps from [0, oo) 
to R n . 

Theorem 1 For any f £ Foo := L>l>oFl and any xo £ M n , the ODE (3) with initial condi¬ 
tions X(0) = xo, X(0) = 0 has a unique global solution X £ C 2 ((0, oo); M n )nC' 1 ([0, oo); R n ). 

The next theorem, in a rigorous way, guarantees the validity of the derivation of this ODE. 
The proofs of both theorems are deferred to the appendices. 

Theorem 2 For any f € F^, as the step size s —> 0, Nesterov’s scheme (1) converges to 
the ODE (3) in the sense that for all fixed T > 0, 

lim max llxt- — X (ky/s) II = 0. 

s^0 0<fc <T" v 

— — , /.<? 


2.1 Simple Properties 

We collect some elementary properties that are helpful in understanding the ODE. 

Time Invariance. If we adopt a linear time transformation, t = ct for some c > 0, by the 
chain rule it follows that 

dX _ 1 dX d 2 X _ 1 d 2 X 
d t c df dt 2 c 2 dt 2 

This yields the ODE parameterized by t, 


d 2 X 

dt 2 


+ |f + V/(X)/o 2 = 0. 


Also note that minimizing f /c 2 is equivalent to minimizing /. Hence, the ODE is invariant 
under the time change. In fact, it is easy to see that time invariance holds if and only if the 
coefficient of X has the form C/t for some constant C. 

Rotational Invariance. Nesterov’s scheme and other gradient-based schemes are in¬ 
variant under rotations. As expected, the ODE is also invariant under orthogonal trans¬ 
formation. To see this, let Y = QX for some orthogonal matrix Q. This leads to 
Y = QX,Y = QX and Vy/ = Q^xf- Hence, denoting by Q T the transpose of Q, 
the ODE in the new coordinate system reads Q T Y + jQ t Y + Q T Xyf = 0, which is of the 
same form as (3) once multiplying Q on both sides. 

Initial Asymptotic. Assume sufficient smoothness of X such that lim^o^f(^) exists. 
The mean value theorem guarantees the existence of some £ £ (0, t) that satisfies X(t)/t = 
(X(t) — X(0))/t = X(£). Hence, from the ODE we deduce X(t) + 3A(£) + V f(X(t)) = 0. 
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Taking the limit t —> 0 gives X(0) = — V/(xo)/4. Hence, for small t we have the asymptotic 
form: 

, V f(xo)t 2 

X(t) = --- b xq + o(t ). 

This asymptotic expansion is consistent with the empirical observation that Nesterov’s 
scheme moves slowly in the beginning. 

2.2 ODE for Composite Optimization 

It is interesting and important to generalize the ODE to minimizing / in the composite 
form f(x) = g(x ) + h(x), where the smooth part g E Fj, and the non-smooth part h : 
M n —>• (— 00 , 00 ] is a structured general convex function. Both Nesterov (2013) and Beck 
and Teboulle (2009) obtain 0(l/k 2 ) convergence rate by employing the proximal structure 
of h. In analogy to the smooth case, an ODE for composite / is derived in the appendix. 


3. Connections and Interpretations 

In this section, we explore the approximate equivalence between the ODE and Nesterov’s 
scheme, and provide evidence that the ODE can serve as an amenable tool for interpreting 
and analyzing Nesterov’s scheme. The first subsection exhibits inverse quadratic conver¬ 
gence rate for the ODE solution, the next two address the oscillation phenomenon discussed 
in Section 1.1, and the last subsection is devoted to comparing Nesterov’s scheme with gra¬ 
dient descent from a numerical perspective. 


3.1 Analogous Convergence Rate 


The original result from Nesterov (1983) states that, for any / E Fr, the sequence {xk} 
given by (1) with step size s < 1/L satisfies 

a \ r* / 2 lko-z *|| 2 lR , 

- s(k + l ) 2 ' <6) 

Our next result indicates that the trajectory of (3) closely resembles the sequence {xk} in 
terms of the convergence rate to a minimizer x*. Compared with the discrete case, this 
proof is shorter and simpler. 


Theorem 3 For any f E T 00 , let X(t) be the unique global solution to (3) with initial 
conditions X(0) = xo, X(0) = 0. Then, for any t > 0, 

f(X(t)) - /* < (7) 

Proof Consider the energy functional 1 defined as £(t) = t 2 (f(X(t)) — /*) + 2|| X + tX/2 — 
x*\\ 2 , whose time derivative is 


t = 2 t(f(X) 


f*) + t Z (Xf, X)+A{X+ t -X-x\^X+ t -X 


1. We may also view this functional as the negative entropy. Similarly, for the gradient flow X + V/(A) = 0, 

an energy function of form £ gr adient(t) = t(f(X(t)) — /*) + ||X(t) — i*|| 2 /2 can be used to derive the 
bound f(X(t)) - f* < llx ° 11 . 


7 







Su, Boyd and Candes 


Substituting 3X/2 + tX /2 with —tXf(X)/ 2, the above equation gives 

£ = 2 t(f(X) - /*) + 4(X - X*, -tV/pO/2) = 2 t(f(X) - n - 2 t(X - x*, V/(X)) < 0, 

where the inequality follows from the convexity of /. Hence by monotonicity of £ and 
non-negativity of 2\\X + tX/2 — x*|| 2 , the gap satisfies 

/(*(<)) - r < fl < £ -f = 


Making use of the approximation i ~ ky/s, we observe that the convergence rate in (6) is 
essentially a discrete version of that in (7), providing yet another piece of evidence for the 
approximate equivalence between the ODE and the scheme. 

We finish this subsection by showing that the number 2 appearing in the numerator of 
the error bound in (7) is optimal. Consider an arbitrary / G ^^(R) such that /(x) = x for 
x > 0. Starting from some xq > 0, the solution to (3) is X(t) = xo — t 2 / 8 before hitting the 
origin. Hence, t 2 (f(X(t)) — /*) = f 2 (xo — t 2 / 8) has a maximum 2x§ = 2|xo — 0| 2 achieved 
at t = 2y/xQ. Therefore, we cannot replace 2 by any smaller number, and we can expect 
that this tightness also applies to the discrete analog (6). 


3.2 Quadratic / and Bessel Functions 

For quadratic /, the ODE (3) admits a solution in closed form. This closed form solution 
turns out to be very useful in understanding the issues raised in the introduction. 

Let f(x ) = \{x,Ax) + (b,x), where A G R nxn is a positive semidefinite matrix and b is 
in the column space of A because otherwise this function can attain —oo. Then a simple 
translation in x can absorb the linear term (b, x) into the quadratic term. Since both the 
ODE and the scheme move within the affine space perpendicular to the kernel of A, without 
loss of generality, we assume that A is positive definite, admitting a spectral decomposition 
A = Q r AQ, where A is a diagonal matrix formed by the eigenvalues. Replacing x with Qx, 
we assume / = ^(x,Ax) from now on. Now, the ODE for this function admits a simple 
decomposition of form 

3 • 

X-i + — Xi + A iXi = 0, i = 1 ,..., n 

with Xj(0) = xo,i,2Q(0) = 0. Introduce Yi(u) = uXi(u/y/ Ai), which satisfies 

u 2 Yi + uYi + (u 2 - 1 )Yi = 0 . 


This is Bessel’s differential equation of order one. Since Yi vanishes at u = 0, we see that 
Yi is a constant multiple of Ji, the Bessel function of the first kind of order one. 2 It has an 
analytic expansion: 


J i(“) = X] 

m =0 


(-I ) 7 


(2m)!!(2m + 2)H 


u 


2m+l 


2. Up to a constant multiplier, Ji is the unique solution to the Bessel’s differential equation u 2 J\ + uJi + 

( u 2 — 1) Ji = 0 that is finite at the origin. In the analytic expansion of Ji, m\\ denotes the double factorial 

defined as m!! = m x (m — 2) x • • • x 2 for even m, or m\\ = m x (m — 2) x • • • x 1 for odd m. 
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which gives the asymptotic expansion 

j 1 («) = (i + c(i))H 
when u —> 0. Requiring X,;(0) = xo,i, hence, we obtain 

Xi(t) = ^=MtV%)■ (8) 

For large t, the Bessel function has the following asymptotic form (see e.g. Watson, 1995): 



cos (t — 37r/4) + 


(9) 


This asymptotic expansion yields (note that f* = 0) 

n 2x 2 / \ 2 

f(x(t)) -r = fix®) = y Ji (ty/ii) 

i— 1 


o 


( Iko 


— X 


,*||2 


V i 3 \/min A i 


( 10 ) 


On the other hand, (9) and (10) give a lower bound: 

limsup t 3 (f(X(t)) - f*) > lim ] fu\f{X{u)) - f*)du 
Moo ^°° t Jo 

\ ft n ' 

= lim - / Y 2xn iuJiiuyTXi^du 

_ ST' ^ x 0,i > 2||xp — x*|| 2 

7T-\Ai ~ vr\/Z 


( 11 ) 


where L = ||A ||2 is the spectral norm of A. The first inequality follows by interpreting 
lim^oo 1 f^u 3 (f(X(u)) — f*)du as the mean of u 3 (f(X(u)) — /*) on (0, oo) in certain 
sense. 

In view of (10), Nesterov’s scheme might possibly exhibit 0(l/k 3 ) convergence rate for 
strongly convex functions. This convergence rate is consistent with the second inequality 
in Theorem 6. In Section 4.3, we prove the 0(l/f 3 ) rate for a generalized version of (3). 
However, (11) rules out the possibility of a higher order convergence rate. 

Recall that the function considered in Figure 1 is f{x) = 0.02x 2 + 0.005x2, starting 
from xo = (1, 1). As the step size s becomes smaller, the trajectory of Nesterov’s scheme 
converges to the solid curve represented via the Bessel function. While approaching the min- 
imizer x*, each trajectory displays the oscillation pattern, as well-captured by the zoomed 
Figure lb. This prevents Nesterov’s scheme from achieving better convergence rate. The 
representation (8) offers excellent explanation as follows. Denote by Ti,72, respectively, 
the approximate periodicities of the first component |Ad| in absolute value and the second 
|AT 2 1- By (9), we get T\ = -k/ y/\[ = 57t and T 2 = ir / y/X^, = 10-7T. Hence, as the amplitude 
gradually decreases to zero, the function / = 2 xq /t 2 + 2xg 2 Ji(V ^2 t) 2 /t 2 has a 

major cycle of IOt, the least common multiple of Xi and T 2 . A careful look at Figure lc 
reveals that within each major bump, roughly, there are 107r/Ti = 2 minor peaks. 
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3.3 Fluctuations of Strongly Convex / 

The analysis carried out in the previous subsection only applies to convex quadratic func¬ 
tions. In this subsection, we extend the discussion to one-dimensional strongly convex 
functions. The Sturm-Picone theory (see e.g. Hinton, 2005) is extensively used all along the 
analysis. 

Let / € Without loss of generality, assume / attains minimum at x* = 0. 

Then, by definition y < f'(x)/x < L for any i / 0, Denoting by X the solution to the 
ODE (3), we consider the self-adjoint equation, 

(*3 Y'y+ t3f '^ )) Y = 0, ( 12 ) 

which, apparently, admits a solution Y(t) = X(t). To apply the Sturm-Picone comparison 
theorem, consider 

(f 3 y')' + ^ Y = o 

for a comparison. This equation admits a solution Y(t) = J\ ( A /Jli)/t. Denote by t\ < t 2 < 
■ ■ ■ all the positive roots of Ji(t), which satisfy (see e .g. Watson, 1995) 

3.8317 = t\ — to > t2 — ts > £3 — £4 > ■ ■ ■ > 7T, 

where to = 0. Then, it follows that the positive roots of Y are t\jfa/y/Ji, ■ ■ •• Since 
t 3 f [X (t))/ X(t) > /it 3 , the Sturm-Picone comparison theorem asserts that X(t) has a root 
in each interval ti + i/y/JI\. 

To obtain a similar result in the opposite direction, consider 

(: t 3 Y')' + Lt 3 Y = 0. (13) 

Applying the Sturm-Picone comparison theorem to (12) and (13), we ensure that between 
any two consecutive positive roots of X, there is at least one i,;/ \[~ L . Now, we summarize 
our findings in the following. Roughly speaking, this result concludes that the oscillation 
frequency of the ODE solution is between 0(y/]T) and 0(y/L). 

Theorem 4 Denote by 0 < t\ < t 2 < ■ ■ ■ all the roots of X(t) — x*. Then these roots 
satisfy, for all i > 1 , 

7.6635 7.6635 vr 

f 1 ^ — , ti -\-1 ti — , ti -(_2 ti .—. 

y/h y/U VL 

3.4 Nesterov’s Scheme Compared with Gradient Descent 

The ansatz t ~ kyfs in relating the ODE and Nesterov’s scheme is formally confirmed in 
Theorem 2. Consequently, for any constant t c > 0, this implies that Xk does not change 
much for a range of step sizes s if k ~ t c /y/s. To empirically support this claim, we present 
an example in Figure 3a, where the scheme minimizes f(x) = ||y — Aa ;|| 2 /2 + ||x||i with 
y = (4, 2 , 0 ) and A(:,l) = ( 0 , 2 , 4), A{\, 2 ) = ( 1 , 1 , 1 ) starting from xq = ( 2 , 0 ) (here 
A(:,j ) is the jth column of A). From this figure, we are delight to observe that Xk with the 
same t c are very close to each other. 
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This interesting square-root scaling has the potential to shed light on the superiority 
of Nesterov’s scheme over gradient descent. Roughly speaking, each iteration in Nesterov’s 
scheme amounts to traveling y/s in time along the integral curve of (3), whereas it is known 
that the simple gradient descent Xk +1 = Xk — sV f{xk) moves s along the integral curve 
of X T V/(X) = 0. We expect that for small s Nesterov’s scheme moves more in each 
iteration since y/s is much larger than s. Figure 3b illustrates and supports this claim, 
where the function minimized is / = |xi | 3 + 5 |x 2| 3 + 0.001(xi + X 2) 2 with step size s = 0.05 
(The coordinates are appropriately rotated to allow xq and x * lie on the same horizontal 
line). The circles are the iterates for k = 1,10, 20, 30,45, 60,90,120,150,190, 250, 300. For 
Nesterov’s scheme, the seventh circle has already passed t = 15, while for gradient descent 
the last point has merely arrived at t = 15. 
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(a) Square-root scaling of s. 



(b) Race between Nesterov’s and gradient. 


Figure 3: In (a), the circles, crosses and triangles are x & evaluated at k = \\/yfs\ , |"2/y / s] 
and |"3/\/5"|, respectively. In (b), the circles are iterations given by Nesterov’s scheme or 
gradient descent, depending on the color, and the stars are X(t) on the integral curves for 
t = 5,15. 


A second look at Figure 3b suggests that Nesterov’s scheme allows a large deviation 
from its limit curve, as compared with gradient descent. This raises the question of the 
stable step size allowed for numerically solving the ODE (3) in the presence of accumulated 
errors. The finite difference approximation by the forward Euler method is 

X(t + At) - 2 X(t) + X(t - At) , 3 X(t) - X(t -At) , ^ _ n 

-- + 1 -At- + V/(XW) “ °’ (14) 

which is equivalent to 

X(t + At) = (2 - ^)x(t) - At 2 Vf(X(t)) - (l - ^)x(t - At). (15) 

Assuming / is sufficiently smooth, we have V/(x + 5x) ~ Vf(x) + \7 2 f(x)5x for small 
perturbations 5x, where V 2 /(x) is the Hessian of / evaluated at x. Identifying k = t/At, 
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the characteristic equation of this finite difference scheme is approximately 


det 



^2 - Af 2 V 2 / 



3 A t\ 
-) =0 - 


( 16 ) 


The numerical stability of (14) with respect to accumulated errors is equivalent to this: all 
the roots of (16) lie in the unit circle (see e.g. Leader, 2004). When V 2 / A LI n (i.e. LI n — 
V 2 / is positive semidefinite), if A t/t small and At < 2 /VL, we see that all the roots of 
(16) lie in the unit circle. On the other hand, if At > 2/VX, (16) can possibly have a root 
A outside the unit circle, causing numerical instability. Under our identification s = At 2 , a 
step size of s = 1 /L in Nesterov’s scheme (1) is approximately equivalent to a step size of 
At = 1 j\fL in the forward Euler method, which is stable for numerically integrating (14). 

As a comparison, note that the finite difference scheme of the ODE A(t)+V/(A(t)) = 0, 
which models gradient descent with updates Xk+i = x\~ — sV/(xj,), has the characteristic 
equation det(A — (1 — AtV 2 /)) = 0. Thus, to guarantee —I n -< 1 — AtV 2 / < I n in worst 
case analysis, one can only choose At < 2/L for a fixed step size, which is much smaller 
than the step size 2/\/L for (14) when V/ is very variable, i.e., L is large. 


4. The Magic Constant 3 

Recall that the constant 3 appearing in the coefficient of X in (3) originates from (k + 
2) — (k — 1) = 3. This number leads to the momentum coefficient in (1) taking the form 
(k — \)/(k + 2) = 1 — 3/k + 0(l/k 2 ). In this section, we demonstrate that 3 can be replaced 
by any larger number, while maintaining the 0{l/k 2 ) convergence rate. To begin with, let 
us consider the following ODE parameterized by a constant r: 

X + jX + V/(A) = 0 (17) 

with initial conditions A(0) = xo,A"(0) = 0. The proof of Theorem 1, which seamlessly 
applies here, guarantees the existence and uniqueness of the solution X to this ODE. 

Interpreting the damping ratio r/t as a measure of friction 3 in the damping system, 
our results say that more friction does not end the 0(l/f 2 ) and 0(1/A: 2 ) convergence rate. 
On the other hand, in the lower friction setting, where r is smaller than 3, we can no 
longer expect inverse quadratic convergence rate, unless some additional structures of / are 
imposed. We believe that this striking phase transition at 3 deserves more attention as an 
interesting research challenge. 


4.1 High Friction 

Here, we study the convergence rate of (17) with r > 3 and / E Too. Compared with (3), 
this new ODE as a damping suffers from higher friction. Following the strategy adopted in 
the proof of Theorem 3, we consider a new energy functional defined as 


2 1 


m = —r(f(X(t))-n + (r-l) 


r — 1 


X{t) + - -X(t) - x* 


r — 1 


3. In physics and engineering, damping may be modeled as a force proportional to velocity but opposite in 
direction, i.e. resisting motion; for instance, this force may be used as an approximation to the friction 
caused by drag. In our model, this force would be proportional to —jX where X is velocity and | is 
the damping coefficient. 
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By studying the derivative of this functional, we get the following result. 


Theorem 5 The solution X to (17) satisfies 

/(*(«))-r < (r ~ 1)2 |j;?-* 1 ’ , 

Proof Noting rX + tX = —tV/(X), we get £ equal to 


m < 


l) 2 ||*o — x*|| 2 
2(r-3) 


4t 2 1 2 ■ t ... 

y(/PO - /*) + I(V/,*> + 2 (X + —X-r*,rl + tX> 

= ^(/PO - /*) - 2 t(X - x*,V/(X)) < -^(/(X) - /*), (18) 

where the inequality follows from the convexity of /. Since /(X) > /*, the last display 
implies that £ is non-increasing. Hence 

0+2 

I (/(X(t)) - /*) < £(t) < £(0) = (r - l)||:ro - **|| 2 , 


yielding the first inequality of this theorem. To complete the proof, from (18) it follows 
that 

—- ~ /*)dt < - [ ^rdt = £(0) -£(oo) <(r- 1)||* 0 - £*|| 2 , 

r-1 Jo dt 

as desired for establishing the second inequality. ■ 

The first inequality is the same as (7) for the ODE (3), except for a larger constant (r—1) 2 /2. 
The second inequality measures the error /(X(t)) — f* in an average sense, and cannot be 
deduced from the first inequality. 

Now, it is tempting to obtain such analogs for the discrete Nesterov’s scheme as well. 
Following the formulation of Beck and Teboulle (2009), we wish to minimize / in the 
composite form f(x) = g(x) + h(x), where g £ J~l for some L > 0 and h is convex on M n 
possibly assuming extended value oo. Define the proximal subgradient 



Gs(x) ± 


x — argmin^ (|| z — (x — sVg(x))|| 2 /( 2 s) + h(z )) 


Parametrizing by a constant r, we propose the generalized Nesterov’s scheme, 


Xk Vk—1 sGsijjk—l) 

k- 1 


Vk = x k + 


(,Xk Xk— l), 


(19) 


k + r — 1 

starting from = xq- The discrete analog of Theorem 5 is below. 
Theorem 6 The sequence {*&} given by (19) with 0 < s < 1/L satisfies 

'<**> - £ (r 2s ^?^" 2 ' - /•» * 


13 















Su, Boyd and Candes 


The first inequality suggests that the generalized Nesterov’s schemes still achieve 0(l/fc 2 ) 
convergence rate. However, if the error bound satisfies f(x k >) — /* > c/k ,2 for some arbi¬ 
trarily small c > 0 and a dense subsequence {&'}, i.e., |{fc'}n{l,... , m}\ > am for all m > 1 
and some a > 0, then the second inequality of the theorem would be violated. To see this, 
note that if it were the case, we would have (kf + r — 1 )(f(xk>) — /*) > p; the sum of the 
harmonic series p over a dense subset of {1, 2,...} is infinite. Hence, the second inequality 
is not trivial because it implies the error bound is, in some sense, 0(l/fc 2 ) suboptimal. 

Now we turn to the proof of this theorem. It is worth pointing out that, though based 
on the same idea, the proof below is much more complicated than that of Theorem 5. 
Proof Consider the discrete energy functional, 

= 2(fc + r-2) 2 s _ _ _ 2 

r — 1 


where Zk 


C k + r - 1 )yk/(r - 1 ) - kx k /(r - 1 ). 

2s[(r-3)(fc + r-2) + l] 


£(k) + 


r — 1 


If we have 

(f(xk- 1 ) - /*) < £(k - 1), 


( 20 ) 


then it would immediately yield the desired results by summing (20) over k. That is, by 
recursively applying ( 20 ), we see 


m + V 2 S [( r -3)(i + r-2) + l] (/ ( x ,_ 1 ) _ n 


i =1 


r — 1 


< ^( 0 ) = Y 5 (/(go) - /*) + (r ~ l)||xo - X *\\ 2 1 


which is equivalent to 


k -1 


£(k) + X/ 


2s[(r — 3 )(i + r — 1) + 1] 


i= 1 


r — 1 


(/(aJi) - /*) <{r~ l)||a;o - x 


* 112 


( 21 ) 


Noting that the left-hand side of (21) is lower bounded by 2s(k + r — 2) 2 {f(xk) — /*)/(r — 1), 
we thus obtain the first inequality of the theorem. Since £{k) > 0, the second inequality 
is verified via taking the limit A: —>• oo in (21) and replacing (r — 3)(* + r — 1) + 1 by 
(r — 3)(* + r — 1). 

We now establish (20). For s < 1/L, we have the basic inequality, 

f(y - sG s (y )) < f(x) + G s (y) T (y - x) - ^||G s (y)|| 2 , ( 22 ) 

for any x and y. Note that yk-i — sG s (yk-i) actually coincides with Xk- Summing of 
(k — l)/(k + r — 2 ) x ( 22 ) with x = Xk-i,y = yk -1 an d (r — l)/(k + r — 2) x ( 22 ) with 
x = x*,y = y k -1 gives 


f(x k ) < —““rr/ (xk—i) + 


r — 1 


:/* 


/c + r — 2 fc + r — 2‘ 

\ T (k + r — 2 k- 1 ,\ . ...r 

+ —- -G s {Vk- 1 ) -— Vk -1 -- a: - - G s (y fc -i) 

V r — 1 r — 1 / 2 


A: + r — 2 
fc- 1 
/c + r — 2 


r — 1 (r — l ) 2 

/(x fc _ 1 ) + ,__ n r+ 1 j 


fc + r — 2 2 s(fc + r — 2) 2 


Zk -1 - a; 


*i|2 


Zfc - a; 


* II 2 
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where we use Zk-\ — s(k + r — 2)G s (yf.-i)/(r — 1) = z^. Rearranging the above inequality 
and multiplying by 2 s(k + r — 2 ) 2 /(r — 1 ) gives the desired ( 20 ). 


In closing, we would like to point out this new scheme is equivalent to setting Of. = 
(r— \)/(k+r— 1 ) and letting 6k{9~j^\ — 1 ) re pl ace the momentum coefficient {k—l)/(k+r—l). 

Then, the equal sign “ = ” in the update 9k +1 = 0 £ + 40| — 6%)/ 2 has to be replaced by 

an inequality sign “ > In examining the proof of Theorem 1(b) in Tseng (2010), we can 
get an alternative proof of Theorem 6 . 


4.2 Low Friction 

Now we turn to the case r < 3. Then, unfortunately, the energy functional approach for 
proving Theorem 5 is no longer valid, since the left-hand side of (18) is positive in general. 
In fact, there are counterexamples that fail the desired 0(l/t 2 ) or 0(l/k 2 ) convergence 
rate. We present such examples in continuous time. Equally, these examples would also 
violate the 0(l/k 2 ) convergence rate in the discrete schemes, and we forego the details. 
Let f(x) = ^||x || 2 and X be the solution to (17). Then, Y = t^~X satisfies 

t 2 Y + tY + (f 2 - (r - l) 2 /4)y = 0. 

v — 1 

With the initial condition Y(t) ~ t~xo for small t, the solution to the above Bessel 

7 * — 1 

equation in a vector form of order (r — l)/2 is Y(t) = 2~ T((r + l)/2) J ( r _ 1 )/ 2 (i) x o- Thus, 

Y|Y| _ 2 ~ T (( r + 1 )/ 2 ) J (r-l)/2{t) 

A \t) r— 1 ^ 0 - 

t~ 


For large t, the Bessel function J(r-i)/ 2 {t) = v / ^/( 7r ^)(cos (t — (r — 1 ) 7 t /4 — 7 r/ 4 ) + 0(l/t)). 
Hence 

f(x(t))-r=o{\\x 0 -x*\\ 2 /f), 

where the exponent r is tight. This rules out the possibility of inverse quadratic convergence 
of the generalized ODE and scheme for all / € Xl if r < 2. An example with r = l is 
plotted in Figure 2. 

Next, we consider the case 2 < r < 3 and let f(x) = |x| (this also applies to multivariate 
/ = ||a ;||). 4 Starting from xq > 0, we get X(t) = xo — WT+F) ^ or ^ + r ) x o- Requiring 

continuity of X and X at the change point 0, we get 


X{t) 


t 2 2(2(1+r)xo) 2 r + 3 
2(1 + r) (r 2 — l)t r_1 r — 1 0 


for y^ 2 (l + r)xo < t < 2 c *(1 + r)x o, where c* is the positive root other than 1 of (r — 

r — 1 - 

l)c + 4c 2 ~ = r + 3. Repeating this process solves for X. Note that t 1 ~ r is in the null 


4. This function does not have a Lipschitz continuous gradient. However, a similar pattern as in Figure 2 
can be also observed if we smooth |*| at an arbitrarily small vicinity of 0. 
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space of X + rX/t and satisfies t 2 x t l ~ r -> oo as i -> oo. For illustration, Figure 4 plots 
t 2 (f(X(t)) — /*) and sk 2 (f(xk ) — /*) with r = 2,2.5, and r = 4 for comparison 5 . It is 
clearly that inverse quadratic convergence does not hold for r = 2, 2.5, that is, (2) does not 
hold for r < 3. Interestingly, in Figures 4a and 4d, the scaled errors at peaks grow linearly, 
whereas for r = 2.5, the growth rate, though positive as well, seems sublinear. 



(a) ODE (17) with r = 2. (b) ODE (17) with r = 2.5. (c) ODE (17) with r = 4. 



Figure 4: Scaled errors t 2 (f(X(t)) — /*) and sk 2 (f(xk ) — /*) of generalized ODEs and 
schemes for minimizing f = \x\. In (d), the step size s = 10 -6 , in (e), s = ICE', and in (f), 
s = 1 (T 6 . 


However, if / possesses some additional property, inverse quadratic convergence is still 
guaranteed, as stated below. In that theorem, / is assumed to be a continuously differen¬ 
tiable convex function. 


Theorem 7 Suppose 1 < r < 3 and let X be a solution to the ODE (17). If (/ 
is also convex, then 


f(X(t)) - r < 


(r — l) 2 ||xo — E || 2 

2 T 2 


/*) 


r—1 

2 


v — 1 

Proof Since (/ — is convex, we obtain 

(f(x(t)) - < (x - x *, v(/(x) - n r -^) 


- n^(x - **, v/(*)>, 


which can be simplified to -^■(/(X) — /*) < (X — x*,Vf(X)). This inequality com¬ 
bined with (18) leads to the monotonically decreasing of £(t) defined for Theorem 5. 
This completes the proof by noting f(X) — /* < (r — l)£(t)/(2f 2 ) < (r — l)£(0)/(2t 2 ) = 
(r — l) 2 ||x 0 — x*|| 2 /( 2 f 2 ). ■ 


5. For Figures 4d, 4e and 4f, if running generalized Nesterov’s schemes with too many iterations (e.g. 10 5 ), 
the deviations from the ODE will grow. Taking a sufficiently small s can solve this issue. 
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4.3 Strongly Convex / 

Strong convexity is a desirable property for optimization. Making use of this property 
carefully suggests a generalized Nesterov’s scheme that achieves optimal linear convergence 
(Nesterov, 2004). In that case, even vanilla gradient descent has a linear convergence rate. 
Unfortunately, the example given in the previous subsection simply rules out such possibility 
for (1) and its generalizations (19). However, from a different perspective, this example 
suggests that 0{t~ r ) convergence rate can be expected for (17). In the next theorem, we 

2 r 

prove a slightly weaker statement of this kind, that is, a provable 0(t~~3 “) convergence rate 
is established for strongly convex functions. Bridging this gap may require new tools and 
more careful analysis. 

Let / € 5 Mi i(M n ) and consider a new energy functional for a > 2 defined as 


£(t-a) = t a (f(X(t )) 


/*) + 


(2 r — a) 2 t 


2 + a-2 


X{t) + 


2 t 


2 r — a 


-X-x* 


When clear from the context, £ (f; a ) is simply denoted as £(t). For r > 3, taking a = 2r/3 

2r 

in the theorem stated below gives f(X(t)) — f* < ||xq — x*\\ 2 /t~. 


Theorem 8 For any f S 5 Ati i(M n ), if 2 < a < 2r/3 we get 

nxm-r<£t 

fi~t a 

for any t > 0. Above, the constant C only depends on a and r. 
Proof Note that £{t-,a) equals 


\2j-Oi-3 


rf-'f/PO - /*) - (2r 2 a) *“ (X - i*, V/PO) + (a 2)(2r 8 a) - ||X - i*|| 2 

+ ( a -2)(2 r-°‘T-\x'X-x'). (23) 

By the strong convexity of /, the second term of the right-hand side of (23) is bounded 
below as 


J-OL— 1 


(2r ~f a - (X - r.vnx)) > (2r -f°~ (HX) - /*) + ^7^ II*-VII 2 . 

Substituting the last display into (23) with the awareness of r > 3a/2 yields 
g < (2/r(2r — a)t 2 — (a — 2)(2r — a) 2 )t"“ 3 .. ^ ~ r ~ a )t a ~ 2 d||X — x*|| 2 


8 


dt 


Hence, if t > t a := (a — 2) (2r — a)/(2p) , we obtain 

(a — 2)(2r — a)t a ~ 2 d||X — x*|| 2 


^)< 


dt 
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Integrating the last inequality on the interval (t a ,t) gives 


(a — 2)(2 r — a)t' 


a-2 


£{t) < £{t a ) + —1| X(t) - **|| 2 - ^- La—\\ X (t a ) - x*|| 2 


(a — 2)(2r — a)t\ 


q-2 


i ft 

— / (a — 2) 2 (2r — a)u a ~ 3 \\X(u) — x*|| 2 ckt < £(t a ) + 
8 Jt n 


{a — 2)(2r — a)t' 


a-2 


'll X(t)-x 


*112 


<£(t a ) + {a 2)(2 '^ ° >r ' (/mi))-n (24) 

Making use of (24), we apply induction on a to finish the proof. First, consider 2 < 
a < 4. Applying Theorem 5, from (24) we get that £{t) is upper bounded by 


£ ( t ^ + (a - 2)(r - l) 2 (2r - «)||x 0 - x*|[ 2 < ^ + (a - 2)(r - l) 2 (2r - a)||x 0 - x*|| 2 


8 fit 4 ~ a 

Then, we bound £(t a ) as follows. 


8 fit a a 


(25) 


£{t a )<W(X{t a ))-n + 


(2 r — a)~t 


2+o—2 


2r — 2 


2r — ct 


2t 

*(t Q ) + -^X{t a ) - 


2r — a 


2r — 2 
-c 

2r — a 


+ 


(2r — a) 2 t] 


2-i.a —2 


a — 2 


2r — a 


*(*«) - 


a — 2 
-; 

2r — a 


< (r - lftr 2 |ko - Yf + (a - 2 ) 2 (r ,~.S ' 10 - Xl ' 2 , (26) 




where in the second inequality we use the decreasing property of the energy functional 
defined for Theorem 5. Combining (25) and (26), we have 


£{t) < (r — 1 ) 2 t" 2 ||x 0 - x*|| 2 + 


2 (a — 2)(r — l) 2 (2r + a — 4)||xq — x 


.* 112 


8 fit, 


4 —a 


= o 


\Xq — X 
a-2 
/i 2 


*112 


For t > t a , it suffices to apply f(X(t)) — f* < £(t)/t 3 to the last display. For t < t a , by 
Theorem 5, f(X(t)) — f* is upper bounded by 


(r - l) 2 ||x 0 - x*|| 2 < (r — l) 2 /i 2 _ [(a - 2)(2r - a)/(2fi)\ 2 \\x Q - x 


* 112 


2t 2 


= O 


a-2 

fl 2 t° 


|| x 0 - X *|| 2 

a-2 


(27) 


Next, suppose that the theorem is valid for some a > 2. We show below that this 
theorem is still valid for a := a + 1 if still r > 3a/2. By the assumption, (24) further 
induces 


£(t) < £(t„) + (°-2)(2r- a )t°- 2 C||xo-x1| 2 £ £(y + C(a_- 2)(2r - a)||*o - +1I 2 


4 A* 


a-2 _ 

fi—t a 


a —1 

4At 2 
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for some constant C only depending on a and r. This inequality with (26) implies 


cf4 .\ / i \2j.a—2 \\_ _*„2 , («- 2 ) 0 "-l) 11*0-®*|| , C'(a- 2 )( 2 r-a)||x 0 - 

— V - 1 / ll^o ^ II I , , i-a ' 


*||2 


= O (||sc 0 - z*|| 2 //iV) , 


4/rt; 


ai —1 

4/x 2 t 0 


which verify the induction for t > t a . As for t < t a , the validity of the induction follows 
from Theorem 5, similarly to (27). Thus, combining the base and induction steps, the proof 
is completed. ■ 


It should be pointed out that the constant C in the statement of Theorem 8 grows with 
the parameter r. Hence, simply increasing r does not guarantee to give a better error bound. 
While it is desirable to expect a discrete analogy of Theorem 8, i.e., 0(l/k a ) convergence 
rate for (19), a complete proof can be notoriously complicated. That said, we mimic the 
proof of Theorem 8 for a = 3 and succeed in obtaining a 0(l/k 3 ) convergence rate for the 
generalized Nesterov’s schemes, as summarized in the theorem below. 


Theorem 9 Suppose f is written as f = g+h, where g £ Sh,l an d h is convex with possible 
extended value oo. Then, the generalized Nesterov’s scheme (19) with r >9/2 and s = 1/L 
satisfies 


f(x k ) ~ f* < 


CL\\x 0 - x*|| 2 y/L/Ji 
k 2 k 


where C only depends on r. 


This theorem states that the discrete scheme (19) enjoys the error bound 0( 1/fc 3 ) with¬ 
out any knowledge of the condition number L//x. In particular, this bound is much better 
than that given in Theorem 6 if k yjL/jj,. The strategy of the proof is fully inspired by 
that of Theorem 8, though it is much more complicated and thus deferred to the Appendix. 
The relevant energy functional S(k) for this Theorem 9 is equal to 


s(2k + 3r — 5)(2fc + 2r — 5) (4fc + 4r — 9) . _ 

-^- (/(*fc) - / ) 

o l- _i_ ^ 

+--1|2 (k + r - 1 )y k - (2k + l)x k - (2r - 3)x*|| 2 . (28) 

16 

4.4 Numerical Examples 

We study six synthetic examples to compare (19) with the step sizes are fixed to be 1/L, as 
illustrated in Figure 5. The error rates exhibits similar patterns for all r, namely, decreasing 
while suffering from local bumps. A smaller r introduces less friction, thus allowing x k moves 
towards x * faster in the beginning. However, when sufficiently close to x*, more friction 
is preferred in order to reduce overshoot. This point of view explains what we observe in 
these examples. That is, across these six examples, (19) with a smaller r performs slightly 
better in the beginning, but a larger r has advantage when k is large. It is an interesting 
question how to choose a good r for different problems in practice. 
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iterations 



iterations 


(b) Lasso with square design. 



(c) NLS with fat design. (d) NLS with square design. 




(e) Logistic regression. 


(f) ^-regularized logistic regression. 


Figure 5: Comparisons of generalized Nesterov’s schemes with different r. 


Lasso with fat design. Minimize f(x) = |||Ax — b \\ 2 + A||x||i, in which A a 100 x 500 
random matrix with i.i.d. standard Gaussian jV"(0,1) entries, b generated independently has 
i.i.d. jV(0, 25) entries, and the penalty A = 4. The plot is Figure 5a. 

Lasso with square design. Minimize f{x) = \\\Ax — b || 2 + A||x||i, where A a 500 x 
500 random matrix with i.i.d. standard Gaussian entries, b generated independently has 
i.i.d. Af(0, 9) entries, and the penalty A = 4. The plot is Figure 5b. 

Nonnegative least squares (NLS) with fat design. Minimize f(x) = \\Ax — b\\ 2 
subject to x 0, with the same design A and b as in Figure 5a. The plot is Figure 5c. 


20 
















































An ODE for Modeling Nesterov’s Scheme 


Nonnegative least squares with sparse design. Minimize f(x) = ||Ax — b || 2 subject 
to x y 0, in which A is a 1000 x 10000 sparse matrix with nonzero probability 10% for each 
entry and b is given as b = Ax° + A/"(0, /iooo)- The nonzero entries of A are independently 
Gaussian distributed before column normalization, and x° has 100 nonzero entries that are 
all equal to 4. The plot is Figure 5d. 

Logistic regression. Minimize y~T 1 —yiafx + log(l + e a ? x ), in which A = (ai,..., a n ) T 
is a 500 x 100 matrix with i.i.d. AA(0,1) entries. The labels yt € {0,1} are generated by the 
logistic model: P(T) = 1) = 1/(1 + e~ a i x ), where x° is a realization of i.i.d. AA(0,1/100). 
The plot is Figure 5e. 

//-regularized logistic regression. Minimize —yiajx + log(l + e a ? x ) + A||ic||i, in 
which A = (ai,... ,a n ) T is a 200 x 1000 matrix with i.i.d. A^(0,1) entries and A = 5. The 
labels yi are generated similarly as in the previous example, except for the ground truth x° 
here having 10 nonzero components given as i.i.d. AA(0,225). The plot is Figure 5f. 


5. Restarting 

The example discussed in Section 4.2 demonstrates that Nesterov’s scheme and its gener¬ 
alizations (19) are not capable of fully exploiting strong convexity. That is, this example 
suggests evidence that 0(l/poly(fe)) is the best rate achievable under strong convexity. In 
contrast, the vanilla gradient method achieves linear convergence 0((1 — y/L) k ). This draw¬ 
back results from too much momentum introduced when the objective function is strongly 
convex. The derivative of a strongly convex function is generally more reliable than that 
of non-strongly convex functions. In the language of ODEs, at later stage a too small 3 /t 
in (3) leads to a lack of friction, resulting in unnecessary overshoot along the trajectory. 

Incorporating the optimal momentum coefficient (This is less than (k — 1 )/{k + 2) 

when k is large), Nesterov’s scheme has convergence rate of 0((1 — \Jy/L) k ) (Nesterov, 
2004), which, however, requires knowledge of the condition number y/L. While it is rel¬ 
atively easy to bound the Lipschitz constant L by the use of backtracking, estimating the 
strong convexity parameter y, if not impossible, is very challenging. 

Among many approaches to gain acceleration via adaptively estimating y/L (see Nes¬ 
terov, 2013), O’Donoghue and Candes (2013) proposes a procedure termed as gradient 
restarting for Nesterov’s scheme in which (1) is restarted with xq = yo := x & whenever 
/(xfc + i) > /(xfc). In the language of ODEs, this restarting essentially keeps (V/, X) nega¬ 
tive, and resets 3 /t each time to prevent this coefficient from steadily decreasing along the 
trajectory. Although it has been empirically observed that this method significantly boosts 
convergence, there is no general theory characterizing the convergence rate. 

In this section, we propose a new restarting scheme we call the speed restarting scheme. 
The underlying motivation is to maintain a relatively high velocity X along the trajectory, 
similar in spirit to the gradient restarting. Specifically, our main result, Theorem 10, ensures 
linear convergence of the continuous version of the speed restarting. More generally, our 
contribution here is merely to provide a framework for analyzing restarting schemes rather 
than competing with other schemes; it is beyond the scope of this paper to get optimal 
constants in these results. Throughout this section, we assume / € S^l for some 0 < y < L. 
Recall that function / € if / € Fl and f(x) — /r||a;|| 2 /2 is convex. 
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5.1 A New Restarting Scheme 

We first define the speed restarting time. For the ODE (3), we call 


T = T(xq ; /) = sup < t > 0 : \/u € (0, t), ^ ^ ^ > 0 


the speed restarting time. In words, T is the first time the velocity ||A|| decreases. Back to 
the discrete scheme, it is the first time when we observe ||a;fc+i — Xk\\ < \\xk — ®fc-i||- This 
definition itself does not directly imply that 0 < T < oo, which is proven later in Lemmas 
13 and 25. Indeed, f(X(t)) is a decreasing function before time T; for t < T, 


d t 


(V/(X),A) 



1 dll All 2 

2 dt 


< o. 


The speed restarted ODE is thus 


X(t) + ?-X(t) + V f(X(t)) = 0, (29) 

^sr 

where t ST is set to zero whenever (X, X) = 0 and between two consecutive restarts, t ST grows 
just as t. That is, t ST = t — r, where r is the latest restart time. In particular, t ST = 0 at 
t = 0. Letting A sr be the solution to (29), we have the following observations. 

• X ST (t) is continuous for t > 0, with A sr (0) = xq ; 

• X sr (t) satisfies (3) for 0 < t < Tj := T(x o; /). 

• Recursively define Tj + i = T ^A^ sr Tjj ; for i > 1, and X(t) := A sr Tj + t 

satisfies the ODE (3), with A"(0) = A sr ^ or 0 < t < Tj i+ \. 

The theorem below guarantees linear convergence of A sr . This is a new result in the 
literature (O’Donoghue and Candes, 2013; Monteiro et al., 2012). The proof of Theorem 10 
is based on Lemmas 12 and 13, where the first guarantees the rate f(X ST ) — f* decays by a 
constant factor for each restarting, and the second confirms that restartings are adequate. 

In these lemmas we all make a convention that the uninteresting case xq = x* is excluded. 

Theorem 10 There exist positive constants c\ and C 2 , which only depend on the condition 
number L/n, such that for any f € S^l, we. have 

- r < c ' L]]x °f x1|2 e-«‘^. 

Before turning to the proof, we make a remark that this linear convergence of A sr 
remains to hold for the generalized ODE (17) with r > 3. Only minor modifications in the 
proof below are needed, such as replacing u 3 by u r in the definition of I(t) in Lemma 25. 
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5.2 Proof of Linear Convergence 

First, we collect some useful estimates. Denote by M(t) the supremum of ||X(it)||/u over 
u € (0, t] and let 

/(f) := [ u 3 (Vf(X(u)) - Vf(x 0 ))du. 

Jo 

It is guaranteed that M defined above is finite, for example, see the proof of Lemma 18. 
The definition of M gives a bound on the gradient of /, 


||V/(X(t))-V/(x 0 )|| <L f X(u)d u <l[ 

Jo Jo 


\\X(u)U^LM(t)t 


u- 


d u < 


u 


Hence, it is easy to see that I can also be bounded via M, 


ii/(*)h< rn 3 iiv/(x(n))-v/(x 0 )iid U < f 

Jo Jo 


LM(u)u 5 < LM(t)t 6 


12 


To fully facilitate these estimates, we need the following lemma that gives an upper bound 
of M, whose proof is deferred to the appendix. 


Lemma 11 For t < VWl, we have 


M(t) < 


l|V/(so)|| 

4(1 - Lt 2 /12) 


Next we give a lemma which claims that the objective function decays by a constant 
through each speed restarting. 

Lemma 12 There is a universal constant C > 0 such that 

Cfi 


f(X(T )) -/*<!- 


L 


(/(*o) - /*)• 


Proof By Lemma 11, for t < yjYljL we have 


X(t) + -V/(s 0 ) 


1 m T u\\\ ^ LM(t)t 3 ^ L||V/(x 0 )||f 3 
t 3 " WI1 - 12 - 48(1-Lf 2 /12)' 


which yields 


n/ Ml ^l|V/(x 0 )||t 3 , £||V/(x 0 )||f 3 

0 < jliv/wii - 48(1 _ Ltyn) < IW)ll < 4IIV/WII + m -Ltyny 


Hence, for 0 < t < 4/(5 \[V) we get 


(30) 


d/PQ 

dt 


= _-||X||2 - - —IIXII 2 < --IIXI 


2 df 


< -- 


3 ft 


l|V/(so)|| - A L J7 fi r^ l }D <-C 1 t\\Xf(x 0 )\\ 2 


t V 4 


48(1 -Lt 2 /12) J 


23 























Su, Boyd and Candes 


where C, > 0 is an absolute constant and the second inequality follows from Lemma 25 in 
the appendix. Consequently, 

/ (^(4/(5\/Z))) - /0o) < -Ciu\\X7f(x 0 )\\ 2 du <-^(/(x 0 ) - /*), 

where C = 16Ci/25 and in the last inequality we use the ^r-strong convexity of /. Thus we 
have 

1 ( x (sTl)) (/(a:o) " n ' 

To complete the proof, note that f(X(T )) < /(A'(4/(5\/L))) by Lemma 25. 


With each restarting reducing the error / — /* by a constant a factor, we still need the 
following lemma to ensure sufficiently many restartings. 

Lemma 13 There is a universal constant C such that 

4 exp (cL//aj 


T < 


5 VL 


Proof For 4/(5\/Z) < t < T, we have < — y||AT(f)|| 2 < — |||AT(4/(5\/L))|| 2 , which 


implies 


f(X(T)) - /(x 0 ) < - 


T l|A:(4/(5VI))|| 2 dt = -3||A(4/(5VI))|| 2 log 

5 VL 


5TVL 


Hence, we get an upper bound for T, 


T < 


5 VL 


exp 


/ /(xo)-/(X(T)) n 


< 


V3||X(4/(5^))|| 2 ^ ' 5^ CXP V 3 ||X(4/(5^))|| 2 - 

Plugging t = 4/(5 \/Z) into (30) gives ||*(4/(5VI))|| > ^L||V/(x 0 )|| for some universal 
constant C\ > 0. Hence, from the last display we get 


/(*o) - /* 


vT 1 


T < 


5^ CXP V3C' 2 ||V/(x 0 )|| 2 ; " 5VL^6C?^ 


( L(f(x 0 ) - /* 


< 


4 L 

: exp ■ 


Now, we are ready to prove Theorem 10 by applying Lemmas 12 and 13. 

Proof Note that Lemma 13 asserts, by time t at least m := \ht\fLeT CL ^/&\ restartings 
have occurred for X sr . Hence, recursively applying Lemma 12, we have 

f(X SI (t )) - /* < / (A sr (Ti + • • • + T m )) - /* 

< (1 - Cfi/L) (/ (A sr (Ti + • • • + Tm.,)) - n 

<■■■<••• 

< (1 - Cn/L) m (f(x o) - n < e~ c ^ L (f(x o) - /*) 

< Cl e- C 2 tVZ (f(x 0 ) - /*) < CiL H X ° 2 ~ x1|2 e ~^vT ; 
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where c\ = exp (C/i/L) and C 2 


5C/ie-^/ L /(4L). 


In closing, we remark that we believe that estimate in Lemma 12 is tight, while not for 
Lemma 13. Thus we conjecture that for a large class of / € if not all, T = 0(y/L//j,). 
If this is true, the exponent constant C 2 in Theorem 10 can be significantly improved. 

5.3 Numerical Examples 

Below we present a discrete analog to the restarted scheme. There, /c m ; n is introduced to 
avoid having consecutive restarts that are too close. To compare the performance of the 
restarted scheme with the original (1), we conduct four simulation studies, including both 
smooth and non-smooth objective functions. Note that the computational costs of the 
restarted and non-restarted schemes are the same. 


Algorithm 1 Speed Restarting Nesterov’s Scheme 

input: xq € M n ,y 0 = xq , x_i = xo,0 < s < 1/L,£; max € N + and k m ; n £ N + 

3 1 

for k = 1 to &; max do 

x k <- argmin a ,(^||x - y k - 1 + s\7g(y k - 1 )|| 2 + h{x)) 

Vk <- x k + j^(x k - x fc _i) 

if \\x k - X k -i\\ < \\x k -i - Xk-2\\ and j > fc min then 
3 <~ 1 

else 

3^3 + 1 

end if 
end for 


Quadratic. f{x) = \x T Ax + b T x is a strongly convex function, in which A is a 500 x 500 
random positive definite matrix and b a random vector. The eigenvalues of A are between 
0.001 and 1. The vector b is generated as i.i.d. Gaussian random variables with mean 0 and 
variance 25. 

Log-sum-exp. 

m 

f(x) = plog ^^exp((af® - bi)/p) , 

2=1 

where n = 50, m = 200, p = 20. The matrix A = () is a random matrix with i.i.d. stan¬ 
dard Gaussian entries, and b = ( bi ) has i.i.d. Gaussian entries with mean 0 and variance 2. 
This function is not strongly convex. 

Matrix completion. f(X) = ^||X 0 b s — Af 0 b s ||^ + A||A||*, in which the ground truth M is 
a rank-5 random matrix of size 300 x 300. The regularization parameter is set to A = 0.05. 
The 5 singular values of M are 1,..., 5. The observed set is independently sampled among 
the 300 x 300 entries so that 10% of the entries are actually observed. 

Lasso in t\— constrained form with large sparse design. f{x) = ^||Ar—6|| 2 s.t. ||x||i < 
(5, where A is a 5000 x 50000 random sparse matrix with nonzero probability 0.5% for each 
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iterations 


(a) min ^x T Ax + bx. 




(e) min ±||Ax - b\\ 2 + 





(f) min \\\Ax - b\\ 2 + A||cc||i. 




iterations iterations 


(g) min Yh =i ~Viafx + log(l + e a ? x ) + A||x||i. (h) min Xa=i ~Vi a J x + log(l + e“^ x ). 

Figure 6: Numerical performance of speed restarting (srN), gradient restarting (grN), the 
original Nesterov’s scheme (oN) and the proximal gradient (PG). 
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entry and b is generated as b = Ax° + z. The nonzero entries of A independently follow the 
Gaussian distribution with mean 0 and variance 0.04. The signal x° is a vector with 250 
nonzeros and z is i.i.d. standard Gaussian noise. The parameter 6 is set to ||x°||i. 

Sorted t\ penalized estimation. f(x) = \\\Ax — b\\ 2 + Y^i=i Aj|x|(n, where |cc|(xj > ■ ■ ■ > 
|x|(p) are the order statistics of \x\. This is a recently introduced testing and estimation 
procedure (Bogdan et al., 2015). The design A is a 1000 x 10000 Gaussian random matrix, 
and b is generated as b = Ax° + z for 20-sparse x° and Gaussian noise z. The penalty 
sequence is set to A, = l.l<f> _1 (l — 0.05z/(2p)). 

Lasso. /(x) = ^||Ax — 6|| 2 + A||x||i, where A is a 1000 x 500 random matrix and b is given 
as b = Ax° + z for 20-sparse x° and Gaussian noise z. We set A = 1.5^/2logp. 

^i-regularized logistic regression. /(x) = Yll Li — UioJx + log(l + e a ? x ) + A||x||i, where 
the setting is the same as in Figure 5f. The results are presented in Figure 6g. 

Logistic regression with large sparse design. f{x) = ^" =] —yiajx + log(l + e a ^ x ), 
in which A = (oi,..., a n ) T is a 10' x 20000 sparse random matrix with nonzero probability 
0.1% for each entry, so there are roughly 2 x 10 8 nonzero entries in total. To generate the 
labels y, we set x° to be i.i.d. Ah(0,1/4). The plot is Figure 6h. 

In these examples, fc m ; n is set to be 10 and the step sizes are fixed to be 1/L. If the 
objective is in composite form, the Lipschitz bound applies to the smooth part. Figure 6 
presents the performance of the speed restarting scheme, the gradient restarting scheme, 
the original Nesterov’s scheme and the proximal gradient method. The objective functions 
include strongly convex, non-strongly convex and non-smooth functions, violating the as¬ 
sumptions in Theorem 10. Among all the examples, it is interesting to note that both speed 
restarting scheme empirically exhibit linear convergence by significantly reducing bumps in 
the objective values. This leaves us an open problem of whether there exists provable 
linear convergence rate for the gradient restarting scheme as in Theorem 10. It is also 
worth pointing out that compared with gradient restarting, the speed restarting scheme 
empirically exhibits more stable linear convergence rate. 

6. Discussion 

This paper introduces a second-order ODE and accompanying tools for characterizing Nes¬ 
terov’s accelerated gradient method. This ODE is applied to study variants of Nesterov’s 
scheme and is capable of interpreting some empirically observed phenomena, such as oscil¬ 
lations along the trajectories. Our approach suggests (1) a large family of generalized Nes¬ 
terov’s schemes that are all guaranteed to converge at the rate 0(1/A: 2 ), and (2) a restarting 
scheme provably achieving a linear convergence rate whenever / is strongly convex. 

In this paper, we often utilize ideas from continuous-time ODEs, and then apply these 
ideas to discrete schemes. The translation, however, involves parameter tuning and tedious 
calculations. This is the reason why a general theory mapping properties of ODEs into 
corresponding properties for discrete updates would be a welcome advance. Indeed, this 
would allow researchers to only study the simpler and more user-friendly ODEs. 

As evidenced by many examples, the viewpoint of regarding the ODE as a surrogate 
for Nesterov’s scheme would allow a new perspective for studying accelerated methods 
in optimization. The discrete scheme and the ODE are closely connected by the exact 
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mapping between the coefficients of momentum (e.g. (k— l)/(k + 2)) and velocity (e.g. 3 /t). 
The derivations of generalized Nesterov’s schemes and the speed restarting scheme are 
both motivated by trying a different velocity coefficient, in which the surprising phase 
transition at 3 is observed. Clearly, such alternatives are endless, and we expect this will 
lead to findings of many discrete accelerated schemes. In a different direction, a better 
understanding of the trajectory of the ODEs, such as curvature, has the potential to be 
helpful in deriving appropriate stopping criteria for termination, and choosing step size by 
backtracking. 
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Appendix A. Proof of Theorem 1 


The proof is divided into two parts, namely, existence and uniqueness. 

Lemma 14 For any f £ J ~ 00 and any xq £ K", the ODE (3) has at least one solution X 
in C 2 ( 0, oo) C C' 1 [0, oo). 

Below, some preparatory lemmas are given before turning to the proof of this lemma. To 
begin with, for any S > 0 consider the smoothed ODE 


X + 


3 

max(d, t) 


A + V/(X) = 0 


(31) 


with X(0) = Xo,A(0) = 0. Denoting by Z = X, then (31) is equivalent to 

dt (z) = 

with X(0) = xo, Z( 0) = 0. As functions of (A, Z ), both Z and —3 Z/ max(5, t) — V/(X)) are 
max(l,L) + 3/<5-Lipschitz continuous. Hence by standard ODE theory, (31) has a unique 
global solution in C 2 [0, oo), denoted by X$. Note that Xs is also well defined at t = (1. 
Next, introduce Mg(t) to be the supremum of ||A, 5 (u)||/u over u € (0 ,t]. It is easy to see 
that Mg(t ) is finite because ||A ( 5 (u)||/it = (HA* (it) — A ( 5(0)||)/it = ||X,5(0)|| + o(l) for small 
u. We give an upper bound for M$(t) in the following lemma. 

Lemma 15 For 5 < yjQ/L, we have 


Ms (5) < 


IIV/MII 

1 — L5 2 /6' 
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The proof of Lemma 15 relies on a simple lemma. 

Lemma 16 For any u > 0, the following inequality holds 

II Vf(X s (u)) - V/(® 0 )|| < \lM 5 {u)u 2 . 


Proof By Lipschitz continuity, 
\\Wf(X s (u))-Wf(x 0 )\\ < L\\X s (u)-x 0 


f X s {y)dv < f 
Jo Jo 


v ll^( t ’)H dn < -LM s (u)u 2 . 
v 2 


Next, we prove Lemma 15. 

Proof For 0 < t < 5, the smoothed ODE takes the form 

X s + |+ Vf(X 5 ) = 0, 

which yields 

X s e 3t/S = - [ t Xf(X s (u))e 3u / 5 du = -Xf(x 0 ) f e 3u ' s du- f (V f(X s (u))-W f(x 0 ))e 3u / s du. 

Jo Jo Jo 

Hence, by Lemma 16 


II^COII 


< -e~ 3t / 5 
t 


<l|V/(®o)|| + 


||V/(x 0 )|| [ e 3u/5 du+-e~ 3t / s [ -LM 5 (u)u 2 e 3u / s d 

Jo t J 0 2 

LMs(5)S 2 


Taking the supremum of ||-X$(i)||/i over 0 < t < 5 and rearranging the inequality give the 
desired result. ■ 

Next, we give an upper bound for Mg(t ) when t > 5. 


Lemma 17 For S < y/ojL and 5 < t < yJl2/L, we have 


M m . (5~LA 2 /6 )||V/(x 0 )|| 

S[ ’ - 4(1 — L5 2 /6)(l — Lt 2 /12) ’ 


Proof For t > <5, the smoothed ODE takes the form 


which is equivalent to 


X, + + V/I.Y.,) = 0, 
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Hence, by integration, t 3 Xs(t ) is equal to 

- [ u 3 Xf(X s {u))du+8 3 X 5 (8) = - [ u 3 Xf(x 0 )du- [ u 3 (X f(X 5 {u))-V f(x 0 ))du+8 3 X 5 (8). 
Js Js Js 


Therefore by Lemmas 16 and 15, we get 

mm < t 4 -8 4 


4t 4 ]|V/(xo)|| + t4 


i r i 




i? LM s (u)u 5 du+ t4 s 


< i||V/(x„)|| + LLM 5 ( t )^ +1^)1, 

where the last expression is an increasing function of t. So for any 8 < t' < t, it follows that 

< i||v/(x„)n + ±LM s (t)e + 

which also holds for t! < 8. Taking the supremum over tl G (0, t) gives 
Ms(t ) < i||V/(x„)|| + ±LM t (t)t 2 + AZg£>I, 


The desired result follows from rearranging the inequality. 


Lemma 18 The function class T = {X$ 
is uniformly bounded and equicontinuous. 


0 , \fdjL 


| <5 = y/3/L/2 m ,m = 0,1,...} 


Proof By Lemmas 15 and 17, for any t G [0, y/6/L\,8 G (0, \J3/L) the gradient is uniformly 
bounded as 

11^)11 < VWlM s (Vs/L) < Vd/L™* { u }= 5v^l|V/(x 0 )|l- 

k 1 2 4 V i 2'h i 

Thus it immediately implies that J- is equicontinuous. To establish the uniform bounded¬ 
ness, note that 

||X 5 (t)|| < 11^(0)11 + /' ||X 5 (u)||du < ||xo|| +30||V/(x 0 )||/L. 

Jo 


We are now ready for the proof of Lemma 14. 

Proof By the Arzela-Ascoli theorem and Lemma 18, J- contains a subsequence converging 
uniformly on [0, Denote by {X$ rn , }?;eN the convergent subsequence and X the limit. 

Above, 8 mi = \J3/L/2 mi decreases as i increases. We will prove that X satisfies (3) and 
the initial conditions A'(O) = xq, X(0) = 0. 
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Fix an arbitrary to £ (0, \J 6/L ). Since ||A, 5 m , (to)II is bounded, we can pick a subsequence 
of Xs m . (to) which converges to a limit, denoted by X®. Without loss of generality, assume 
the subsequence is the original sequence. Denote by X the local solution to (3) with X(to) = 
X(to) and X(to) = X^. Now recall that Xs m . is the solution to (3) with A'(to) = Xs m . (to) 
and X(to) = Xs m .{t 0 ) when 5 mi < t 0 . Since both Xs m . (t 0 ) and Xs m . (t 0 ) approach X(t 0 ) 
and X respectively, there exists eo > 0 such that 

sup \\X Srni (t)-X(t)\\^0 

to—eo<t<to+eo 

as i —>• oo. However, by definition we have 

sup \\X Sm .(t)-X(t)\\ ->-0. 

to—eo<t<to+eo 

Therefore X and X have to be identical on (to —eo,to + eo)- So X satisfies (3) at to- Since to 
is arbitrary, we conclude that X is a solution to (3) on (0, yj 6/L). By extension, X can be 
a global solution to (3) on (0, oo). It only leaves to verify the initial conditions to complete 
the proof. 

The first condition A(0) = xo is a direct consequence of Xs (0) = xq. To check the 
second, pick a small t > 0 and note that 


ll*(*)-*(Q)ll 

t 


lim 

2—>• OO 


\\x 5m st) - x Srnt m 

t 


= lim ll^<5 mi te)ll 

i —^OO 1 

< lim sup tM Srn .(t) < 5ty / 67r||V/(xo)||, 

2—>• OO 


where € (0, t) is given by the mean value theorem. The desired result follows from taking 

t 0. ■ 


Next, we aim to prove the uniqueness of the solution to (3). 

Lemma 19 For any f £ Foo, the ODE (3) has at most one local solution in a neighborhood, 
oft = ( 1 . 

Suppose on the contrary that there are two solutions, namely, X and Y, both defined on 
(0, ck) for some a > 0. Define M(t) to be the supremum of ||X(tt) — T(u)|| over u £ [0 ,t). 
To proceed, we need a simple auxiliary lemma. 

Lemma 20 For any t £ (0, a), we have 

||V/(X(t))-V/(y(t))|| <LtM(t). 

Proof By Lipschitz continuity of the gradient, one has 


\\xf(x(t)) - v/(y(t))|| < L\\x(t) - y(t)|| = l 


X{u) - Y(u)du + A(0) - y(0) 



||X(u) -y(u)||d« < LtM(t). 
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Now we prove Lemma 19. 

Proof Similar to the proof of Lemma 17, we get 


- Y(t)) = - f u 3 (Xf(X(u)) - V/(y(«)))d' 
Jo 


,U. 


Applying Lemma 20 gives 



which can be simplified as ||X(t) — T(t)|| < Lt 2 M(t)/ 5. Thus, for any t' < t it is true 
that || X{t') — Y{t')\\ < Lt 2 M{t)/ 5. Taking the supremum of ||X(f') — Y(t')\\ over if € (0 ,t) 
gives M(t ) < Lt 2 M(t)/ 5. Therefore M(t) = 0 for t < min(a, y/5/L), which is equivalent 
to saying X = Y on [0, min (ck, yJh/L)). With the same initial value X(0) = T(0) = xq 
and the same gradient, we conclude that X and Y are identical on (0, min(a, ym/L)), a 
contradiction. ■ 

Given all of the aforementioned lemmas, the proof of Theorem 1 is simply combining 14 
and 19. 

Appendix B. Proof of Theorem 2 

Identifying yfs = At, the comparison between (4) and (15) reveals that Nesterov’s scheme 
is a discrete scheme for numerically integrating the ODE (3). However, its singularity of the 
damping coefficient at t = 0 leads to the nonexistence of off-the-shelf ODE theory for proving 
Theorem 2. To address this difficulty, we use the smoothed ODE (31) to approximate the 
original one; then bound the difference between Nesterov’s scheme and the forward Euler 
scheme of (31), which may take the following form: 


Xi +1 = Xi + A tz{ 



(32) 


with Xq = xq and Zq = 0. 

Lemma 21 With step size At = y/s, for any T > 0 we have 


max ||_Xfc — Xk\\ < C5 2 + o s (l) 
l<fc<-2= 

— ~ VS 


for some constant C. 

Proof Let = {xk+i — x^)/yfs- Then Nesterov’s scheme is equivalent to 

Xk- 1-1 Xk T V^Zk 



(33) 
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Denote by ak = ||Al| — Xk\\, b/, = \\Z^ — Zk ||, whose initial values are ao = 0 and bo = 
||V/(xo)||\/s- The idea of this proof is to bound via simultaneously estimating ak and 
bk■ By comparing (32) and (33), we get the iterative relationship for a&: a^+i < Ofc + yfsbk■ 
Denoting by Sk = bo + &i + • • • + bk, this yields 

a k < y/sSk-i- (34) 


Similarly, for sufficiently small s we get 

3 


frfc+i < 


1 - 


maxjh/yAs, k} 


< 6^ + L\fsak + ^ 


bk + L\fsa,k + ^ 
3 


k + 3 ma x{6/y/s,k} 


+ 2Ls j \\zk\ 


k + 3 max-JA/v^, k} 


+ 2 Ls) I 


Zk | 


To upper bound ||zfc||, denoting by C\ the supremum of \J2L(f(yk) — /*) over all k and s, 
we have 

k _i 

\\ z k\\ < ll^fc-1 II + Vs\\Vf(y k )\\ < ||Zfc-l|| + Ciy/s, 

which gives \\zk\\ <C\{k + Cjyfs. Hence, 

3 3 , or An n / k< 4= 

+ 2Ls)\\z k \\<ig^ 

l k < S ' K> +~ s ' 


k + 3 ma x{5/y/~s,k} 

Making use of (34) gives 


fyc+i < 


bk T LsSk— 1 + C 2 y/i, k — &/\/~S 
bk + LsSk— i + k > 6/y/s. 


(35) 


By induction on k, for k < d/y/s it holds that 


b k < ClLs + ° 2 + + ^^{i+VTsf- 1 - ClLs + ° 2 ~ ^ + C2) ^(i -VTsf- 1 . 

2 vL 2vl 


Hence, 

c ^ C\Ls + C 2 + (fi + C 2 )y/^Ls n~\k i C±Ls + C 2 — (C\ + C 2 )y/Ls n~\k C 2 
Sk S - 2L+s - {1 + ' /Ls) + -2 L+, - (Wi5) ~Uft- 

Letting k* = |_<5/ vAsJ, we get 


limsup -v/sS’fc*- 
s—s-0 

which allows us to conclude that 


i < 


C 2 e 


< y/sS k 


^ + C 2 e~ 5VZ - 2 C 2 
2 L 

-l = 0(5 2 ) + o s (l) 


0 ( 5 2 ), 


(36) 


for all k < 5/yfs. 
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Next, we bound bk for k > k* = \8/yfs\. To this end, we consider the worst case of 
(35), that is, 


bk+i — bk + LsSk-i + 


C 2 s 

5 


for k > k* and Sk* = Sk*+ 1 = C^d 2 /yfs + o s {\/yfs ) for some sufficiently large C 3 . In this 
case, C 2 s/8 < sSk-i for sufficiently small s. Hence, the last display gives 


bk +1 < bk + {L + l)sSk-i- 


By induction, we get 

s t < ( (1 + + (1 - . 

Letting k° = [T /-^/sj, we further get 


which yields 

a k < VsSk-i = 0(5 2 ) + o s (l) 

for k* < k < k°. Last, combining (36) and the last display, we get the desired result. 


Now we turn to the proof of Theorem 2. 

Proof Note the triangular inequality 

||*fc - X[kyTs) II < II X k - Xi\\ + \\X S k - X S (kyffi\\ + II X 5 (kyfs) - X{kyfs) ||, 

where Xg(-) is the solution to the smoothed ODE (31). The proof of Lemma 14 implies 
that, we can choose a sequence 5 m —>• 0 such that 

sup ||X,5 m (t) - X(t)|| ^ 0. 

0 <t<T 

The second term \\X k m — X$ m (kyfs )|| will uniformly vanish as s —>■ 0 and so does the first 
term \\xk — ^f m || if first s —» 0 and then 8 m —>• 0. This completes the proof. ■ 


Appendix C. ODE for Composite Optimization 

In analogy to (3) for smooth / in Section 2, we develop an ODE for composite optimization, 

minimize f{x) = g(x) + h(x), (37) 

where g € J~l and h is a general convex function possibly taking on the value + 00 . Provided 
it is easy to evaluate the proximal of h, Beck and Teboulle (2009) propose a proximal 
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gradient version of Nesterov’s scheme for solving (37). It is to repeat the following recursion 
for k > 1, 

x k = y fc _i - sGt{yk- 1) 
k — 1 . 

Uk — %k “I - 2 ljj 

where the proximal subgradient G s has been defined in Section 4.1. If the constant step 
size s < 1/L, it is guaranteed that (Beck and Teboulle, 2009) 


t, \ / 2 lko - ®*|| 2 

/(xt) - ; s s(k + ly • 

which in fact is a special case of Theorem 6. 

Compared to the smooth case, it is not as clear to define the driving force as V/ in (3). 
At first, it might be a good try to define 


G(x) 


x~ai-gmm z (\\z-(x-sVg(x))\\ 2 /{2s) + h{z)) 

lmiG s (x) = hm- 

s—>0 s—>0 s 


if it exists. However, as implied in the proof of Theorem 24 stated below, this definition fails 
to capture the directional aspect of the subgradient. To this end, we define the subgradients 
through the following lemma. 


Lemma 22 (Rockafellar, 1997) For any convex function f and any x,p € W 1 , the direc¬ 
tional derivative lim^o +{f{x + sp ) — f(x))/s exists, and can be evaluated as 


lim 

s—> 0 + 


f(x + sp) - f(x) 
s 


sup (£,p). 

£,£df{x) 


Note that the directional derivative is semilinear in p because 


for any c > 0. 


sup (£, cp) = c sup (£,p) 
£e<9 f{x) ?6 df{x) 


Definition 23 A Borel measurable function G(x,p;f) defined on W 1 x M n is said to be a 
directional subgradient of f if 


for all x,p. 


G(x,p ) € df(x), 

(G(x,p),p) = sup (£,p) 
£edf(x) 


Convex functions are naturally locally Lipschitz, so df(x) is compact for any x. Con¬ 
sequently there exists £ G df(x ) which maximizes (£,p). So Lemma 22 guarantees the 
existence of a directional subgradient. The function G is essentially a function defined on 
M n x S” -1 in that we can define 


G(x,p) = G(x,p/||p||), 

and G(x,0) to be any element in df(x). Now we give the main theorem. However, note 
that we do not guarantee the existence of solution to (38). 
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Theorem 24 Given a convex function f(x) with directional subgradient G(x,p-,f), assume 
that the second order ODE 

X + ^X + G(X,X) = 0, A(0) = x 0 , X(0) = 0 (38) 

admits a solution X(t) on [0, a) for some a > 0. Then for any 0 < t < a, we have 

/(xw)-/*< 2|i ° t r , "i 

Proof It suffices to establish that £, first defined in the proof of Theorem 3, is monotonically 
decreasing. The difficulty comes from that £ may not be differentiable in this setting. 
Instead, we study (£(t + A t) — £{t))/At for small At > 0. In £, the second term 2\\X + 
tX /2 — x *|| 2 is differentiable, with derivative &{X + — x*, |X + |A). Hence, 

2 ||X(f + At) + ix(t + At) - x *|| 2 - 2 ||X(t) + ^X{t) - x *|| 2 

= 4(X + f -X - x\ ^X + \x)At + o(At) ( 39 ) 

= -t 2 (X, G(X, X))At - 2 t(X - x *, G(X, X))At + o(At). 

For the hrst term, note that 


(t + At) 2 (f(X(t + At)) - n - t 2 (f(X(t)) - n = 2 t(f(X(t + At)) - f*)At+ 

t 2 (f(X(t + At))-f(X(t))) + o(At). 

Since / is locally Lipschitz, o(At) term does not affect the function in the limit, 

f(X(t + At)) = f(X + AtX + o(At)) = f{X + AtX) + o(At). (40) 

By Lemma 22, we have the approximation 

f(X + AtX) = f(X) + (X, G(X, X))At + o(At). (41) 

Combining all of (39), (40) and (41), we obtain 

£{t + At) - £{t) = 2t(f(X(t + At)) - f*)At + t 2 (X, G{X, X))A t - t 2 {X, G{X, X))At 

-2t(X - x*, G(X, X))At + o(At) 

= 2t(f(X) - f*)At - 2t(X - x *, G(X, X))At + o(At) < o{At), 


where the last inequality follows from the convexity of /. Thus, 

£(t + At) — £(t) 

limsup---< 0, 

Ai—>0+ At 


which along with the continuity of £, concludes that £(t) is a non-increasing function of t. 
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We give a simple example as follows. Consider the Lasso problem 


minimize -\\y — Ax\\" + A||x||i. 


Any directional subgradients admits the form G(x,p) = —A T {y — Ax) + Asgn (x,p), where 


sgn (x,p)i 


sgn(xj), 

< sgn fa), 
.6 [- 1 , 1 ], 


Xi^O 

Xi = 0,Pi 7 ^ 0 

Xi = 0 ,pi = 0 . 


To encourage sparsity, for any index i with Xj = 0 ,pi = 0, we let 


G(x,p)i = sgn {Af (Ax - y)) (| Aj{Ax - y)\ - A) + . 


Appendix D. Proof of Theorem 9 

Proof Let g be /r-strongly convex and h be convex. For f = g + h, we show that (22) can 
be strengthened to 

f{y-sG s {y )) < f(x) + G s {y) T {y - x) - |||G s (y )|| 2 - |||y-x|| 2 . (42) 

Summing (4 k — 3) x (42) with x = Xk-i,y = yk-i and (4r — 6) x (42) with x = x*, y = yk-i 
yields 


(4 k + 4 r - 9 )f{x k ) < (4 k - 3)/(x fc _i) + (4r - 6 )/* 

+ G s (y fc _i) T [(4/c + 4r - 9)y fc _i - (4 k - 3)x fc _i - (4r - 6 )x*] 


s(4:k + 4r — 9) 


||G' s (yfc_i)|| 2 - 


p{Ak — 3) 


t-i - Sfc-i || 2 - ^( 2 r - 3)||y fc -i - x 


A || 2 


< (4 k - 3)/(x fc _i) + (4r - 6 )/* - p(2r - 3)||y fc -i - a ^*|| 2 

+ G s {y k -i) T [(4 k + 4 r - 9){y k -\ - a:*) - (4 k - 3)(x fc _i - x*)] , (43) 


which gives a lower bound on G s {y k ~i) T [(4 k + 4r — 9)y k -\ — (4 k — 3)x k -\ — (4 r — 6 )x*]. 
Denote by A k the second term of £{k) in (28), namely, 


A fc = —||(2fc + 2 r - 2 ){y k - x*) - {2k + \){x k - x*)|| 2 , 

O 
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where d := 3r/2 — 5/2. Then by (43), we get 
A fc -A fc _i =-— ( s{2r+2k-5)G s {y k - 1 )+— - -(x k -i-x k - 2 ), ( 4k+4r-9)(y k -i-x*) 

- (4k- 3){x fe _i -X*)) + t||(2fc + 2r- 4)(y fe _! - x*) - (2fc — l)(x fc _i - x*)|| 2 

< _A±A|±^^) [(« + 4r - 9)(/(x t ) - /*) 

- (4fc - 3)(/(x fc _i) - /*) + n(2r - 3)||j/ fc _i - x*|| 2 ] 

- ~ Xfc - 2 ’ ( 4/c + 4r ~ 9 )(Vk -1 - ®*) - (4fc - 3)(x fe _i - x*)) 

+ i||2(fc + r - 2)(y fc _! - x*) - {2k - l)(x*_i - x*)|| 2 . 

O 


Hence, 


A t s{k + d){2k + 2r-h){4k + 4r-V) 

Afc +-g- {f{x k ) ~ J ) 

/ a , s{k + <I)(2k + 2r-5)(4k-3) 

< i H--- {f{x k - 1 ) - / ) 


sfi{2r — 3 ){k + d){2k + 2r — 5) 
8 


_i-x *|| 2 + ni + n 2 , (44) 


where 


Ul ~ ~ ^ k Qn^ k ( Xk ~ l ~ Xfc " 2 ’ ( 4fc + 4r ~ 9 ){Vk- 1 - z*) - (4fc - 3)(x fc _i - x*)), 
8 (fc + r — 2 ) 

n 2 = ^||2(fc + r - 2)(y fe _i - x*) - {2k - l)(x fc _i - x*)|| 2 . 

O 

By the iterations defined in (19), one can show that 
(2 r — 3)(fc + d)(fc — 2) 


n, = - 


(||x fc _i - x*|| 2 - ||x fc _ 2 - X 


*|| 2 \ 


8 {k + r - 2) 

(fc - 2) 2 (4fc + 4r - 9)(fc + d) + (2r - 3)(fc - 2){k + r - 2){k + d) 

8 (fc + r — 2) 2 


||®fc-l — ®fc- 2 | 


n, = 


(2r - 3 ) 2 


fc—l 


,* II 2 


( 2 r — 3)(2fc — 1) (A: — 2 ) 
8{k + r-2) 


(||x fc _i - x*|| 2 - ||x fc _ 2 - X 


* 112-1 


+ 


(fc - 2) 2 (2 k - l)(2fc + 4r - 7) + (2r - 3)(2fc - l)(fc - 2)(fc + r - 2) 


||xfe_i - x k — 2 1| 2 • 


8 (fc + r — 2) 2 

Although this is a little tedious, it is straightforward to check that {k— 2) 2 (4fc+4r— 9){k+d)+ 
(2r-3)(fc-2)(fc+r-2)(fc+d) > (fc-2) 2 (2fc-l)(2fc+4r-7) + (2r-3)(2fc-l)(fc-2)(fc+r-2) 
for any k. Therefore, Hi + n 2 is bounded as 


ni+n 2 < 


(2r - 3) s 
8 


t-i-xir + 


2 (2r — 3)(fc — d — l)(fc — 2) 


8(fc + r — 2) 


(||x fc _i-x*|| 2 -||x fc _ 2 -x 


* 112 \ 
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which, together with the fact that sy(2r — 3) (A; + d)(2k + 2r — 5) > ( 2 r — 3 ) 2 when k > 
yj(2r — 3)/(2s/i), reduces (44) to 


4, + »(* + -0(2fc + fr-5)( 4k + 4r - a \ fixit ) - /*) 

< 4 fc _ 1 + ^ + »i)(2t + 2,-5)(4 fc -3) (/fa _ i) _ r) 


+ 


(2r-3)(fc-d-l)(fc-2) 
8(k + r — 2 ) 


(lkfe-i - ** 11 - lkfc-2 - ®*|| )■ 


This can be further simplified as 


£{k) + A k (f(x k - i) - /*) < £(fc - 1) + B k (\\x k -i - x*\\ 2 - \\x k -2 ~ ai*|| 2 ) (45) 

for k > yj (2 r — 3)/(2s/r), where = (8r — 36)fc 2 + (20r 2 — 126r + 200 )k + 12r 3 — 100r 2 + 
288r — 281 > 0 since r > 9/2 and B k = (2r — 3)(fc — d — 1 )(fc — 2)/(8{k + r — 2)). Denote 
by fc* = [maxjy 7 (2r — 3)/(2 s/r), 3r/2 — 3/2}] x 1 /y/sji. Then B k is a positive increasing 
sequence if k > k*. Summing (45) from k to k* + 1, we obtain 

k k 

S{k)+ ]T Mf(xi-i) - n < £(k*) + ]T Bi{\\xi-\ — x*\\ 2 — \\xi -2 — x*\\ 2 ) 

i=k *+1 j=fc*+l 

k -1 

= £{k*) + B k \\x k -i - x*\\ 2 - B k * +1 \\x k *-i - x*|| 2 + ^2 (Bj - B j+l )\\xj-i - x*|| 2 

i=k *+1 

< £(k*) + B k \\x k _i - x*|| 2 . 


Similarly, as in the proof of Theorem 8 , we can bound £{k *) via another energy functional 
dehned from Theorem 5, 


«(**) < s(2k ' +3r -f k ' +r - 2)2 (f( xt .) - /*) 

ot-* _i_ _ c; 

H-—-|| 2 (/c* + r - l)y k * - 2k*x k * - 2(r - l)x* - ( x k * - x *)|| 2 


16 


< 5(2 *' + ^ ~ ?(** + r ~ 2 > 2 (/(xt .) - 


Ok* _i_ o. r — 5 

H - 3 + r ~ l )Vk* ~ 2 k*x k * - 2 (r - l)x *|| 2 


+ 


2k* + 3r — 5 , 
8 1 


x k * — x \\ < 


2 (r — l) 2 (2fc* + 3r — 5) 


||x 0 - x 


* 112 


(r — l) 2 (2Ai* + 3r — 5) _* l|2 ^ l|*o - ®' 

H-“-7TT -^-IpO — X || X - 


.-*-112 


8 sfi(k* + r — 2) 2 


y/s/Z 


(46) 
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For the second term, it follows from Theorem 6 that 


„ „ * l|2 ^ (2r-3)(2fc-3r + 3)(fc-2),,, , 

B k \\x k -! - x*\\ 2 < 4---(/(Sfe- 1 ) - x*) 


< 


< 


8 n(k + r — 2) 

(2r - 3)(2fc - 3r + 3 ){k - 2) (r - l) 2 ||x 0 - a:*|| 2 


8/x(fc + r — 2) 


2s(fc + r — 3) 2 


(2r - 3)(r - l) 2 (2fc* - 3r + 3)(fc* - 2) 2 ^ ||s 0 - s 

16s/r(fc* + r — 2)(fc* + r — 3) 2 0 ~ y/sji 


.*112 


For k > k*, (46) together with (47) this gives 

16£(&) 


/(zfc) - /* < 


< 


s(2k + 3 r- 5){2k + 2r - 5)(4fc + 4r - 9) 
16(£(A:*) + B k \\x k -i - x*|| 2 ) 


< 


||x 0 - x*\\ 2 


(47) 


s{2k + 3r — 5)(2/c + 2r — 5)(4fc + 4r — 9) s^n^k 3 
To conclusion, note that by Theorem 6 the gap f{x k ) — f* for k < k* is bounded by 


(r — l) 2 ||xo — x*\\ 2 {r — l) 2 y/sjlk 3 \\xq — x*\\ 2 ^ / _ * ||®o — x *\\ 2 ^ ||xo — x*\\ 2 

2s{k + r- 2)2 = 2{k + r — 2) 2 *§^*3 ~ ^ ’ s f^fc3 ~ ,1^*3 


Appendix E. Proof of Lemmas in Section 5 

First, we prove Lemma 11. 

Proof To begin with, note that the ODE (3) is equivalent to d(t 3 X{t))/dt = —t 3 Xf{X{t)), 
which by integration leads to 

t 3 X{t) = ~Vf(x o) - J^ u 3 (Xf{X(u)) - V/(s 0 ))du = - jV/(s„) - /(*)■ (48) 

Dividing (48) by f 4 and applying the bound on I(t), we obtain 

||A(f)|| ||V/(xo)|| ||J(*)|| < ||V/(xo)|| LM{t)t 2 

t 4 t 4 ~ 4 12 ' 

Note that the right-hand side of the last display is monotonically increasing in t. Hence, by 
taking the supremum of the left-hand side over (0, t], we get 

||V/(xo)|| LMWg 
w - 4 12 

which completes the proof by rearrangement. 


Next, we prove the lemma used in the proof of Lemma 12. 


40 




















An ODE for Modeling Nesterov’s Scheme 


Lemma 25 The speed restarting time T satisfies 


T(x 0 ,f) > 


5 VI' 


Proof The proof is based on studying (X(t),X(t)). Dividing (48) by t 3 , we get an expres¬ 
sion for X, 

X(t) = 0 ) - u 3 (Xf(X(u)) - V/(x„))du. (49) 

Differentiating the above, we also obtain an expression for X: 


X(t) = -Vf(X(t)) + ^V/(x 0 ) + ^J u 3 (Vf(X(u)) - V/(x 0 ))dtt. (50) 

Using the two equations we can show that d||X|| 2 /dt = 2 (X(t), X(t)) > 0 for 0 < t < 
4/(5 y/L). Continue by observing that (49) and (50) yield 

(X(t),X(t)) = ( - ^V/(xo) - - Vf(X(t)) + ^V/(xo) + |/(t) 


t 1 

> 4<V/(xo), V/(X(t))) - -||V/(x 0 )|| 2 - ^||/(t)|| 


3 

+ 2 


-£ii mf 


t t 

> i l|V/(xo)|| 2 - i ||V/(x 0 )||||V/(X(t)) - V/(x 0 )|| - ^||V/(*o)|| 2 


LM(t)t 3 
12 


-V/(x 0 )|| + - 


L 2 M(t) 2 t 5 
48 


> 4||v/ ( x o) f - + 1 

To complete the proof, applying Lemma 11, the last inequality yields 

1 Lt 2 L 2 t 4 


L 2 M(t) 2 t 5 


48 


(X(t),X(t))> 


||V/(xo)|| 2 t>0 


.16 12(1 — Lt 2 /12) 256(1 — Lt 2 /12) 2 

for t < min{ yJV2/L, 4/(5\/X)} = 4/(5\/X), where the positivity follows from 


Lt 2 


L 2 t 4 


16 12(1 — Lt 2 /12) 256(1 — Lt 2 /12) 2 

which is valid for 0 < t < 4/(5 y/L). 


> 0 , 
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